Setup

For the lecture I used the following code to generate data:

n = 50
set.seed(101)
x1 = runif(n)
x2 = runif(n)
prob  = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3)
y  = as.factor(rbinom(n,1,prob))
levels(y) = c("1","2")
df = data.frame(y=y,x1=x1,x2=x2)

And then I used the following function to make KNN predictions:

knn_pred = function(point,x1,x2,y,k=5, thresh = 0.5) {
  dists = sqrt((x1-point[1])^2+(x2-point[2])^2) #Find all distances to current obs
  bound = sort(dists)[k]                #Find kth smallest distance
  indices = which(dists <= bound)       #Find which obs have dists 1:k
  outcomes = as.integer(y[indices]) #Find corresponding outcomes y
  prob = mean(outcomes)-1
  if (prob < thresh){
    out = 1
  } else {
    out = 2
  }
  out
  
}

This code builds a grid of points, and then makes predictions for each of those points.

grid.fineness = 201
sequence = seq(0,1,length.out=grid.fineness)
grid = expand.grid(sequence,sequence)
colnames(grid) = c("x1","x2")
yhat = apply(grid,1,knn_pred,x1=x1,x2=x2,y=y,k=5)
yhat = as.factor(yhat)

With those predictions, we can build a dataframe, and plot.

df = as.data.frame(grid)
df$y = yhat
ggplot(df,aes(x=x1,y=x2,col=y))+geom_point(size=0.08)

If we drop the round and subtract 1 in our knn_pred function, we can get probabilities out.

knn_prob = function(point,x1,x2,y,k=5) {
  dists = sqrt((x1-point[1])^2+(x2-point[2])^2) #Find all distances to current obs
  bound = sort(dists)[k]                #Find kth smallest distance
  indices = which(dists <= bound)       #Find which obs have dists 1:k
  outcomes = as.integer(y[indices])     #Find corresponding outcomes y
  mean(outcomes)-1 #Taking advantage of 2 outcomes.
}

We can predict those probabilities at each point:

phat = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=5)
df$phat = phat
ggplot(df,aes(x=x1,y=x2,col=phat))+geom_point(size=0.08)

This is, in essence, the beginnings of a simulation study. We generated data, and we can look at how our predictions perform. We can do this with either the classifications or the underlying probabilities.

Questions

We are going to extend this simulation study in a few ways.

Q1 - Bigger Sample

Resetting the seed with:

set.seed(101)

Run the same data generation code, but with a sample size of 1000. Plot the resulting probabilities when we use \(K=1\), \(K=5\), and \(K=25\).

n = 1000
set.seed(101)
x1 = runif(n)
x2 = runif(n)
prob  = ifelse(x1 < 0.5 & x1 > 0.25 & x2 > 0.25 & x2<0.75,0.8,0.3)
y  = as.factor(rbinom(n,1,prob))
levels(y) = c("1","2")
bigdf = data.frame(y=y,x1=x1,x2=x2)
grid.fineness = 201
sequence = seq(0,1,length.out=grid.fineness)
grid = expand.grid(sequence,sequence)
colnames(grid) = c("x1","x2")
# yhat = apply(grid,1,knn_pred,x1=x1,x2=x2,y=y,k=1)
# yhat = as.factor(yhat)
bigdf = as.data.frame(grid)

# Create probabilities for each K value
bigdf$phat1 = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=1)
bigdf$phat5 = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=5)
bigdf$phat25 = apply(grid,1,knn_prob,x1=x1,x2=x2,y=y,k=25)
ggplot(bigdf,aes(x=x1,y=x2,col=phat1))+geom_point(size=0.08)+
  labs(title = 'K=1')

ggplot(bigdf,aes(x=x1,y=x2,col=phat5))+geom_point(size=0.08)+
  labs(title = 'K=5')

ggplot(bigdf,aes(x=x1,y=x2,col=phat25))+geom_point(size=0.08)+
  labs(title = 'K=25')

Plot the classification predictions when \(K=10\), using a probability threshold of 0.2 for our predictions instead of the standard 0.5.

yhat10 = apply(grid,1,knn_pred,x1=x1,x2=x2,y=y,k=10, thresh = 0.2) %>% as.factor()
bigdf$yhat10 = yhat10
ggplot(df,aes(x=x1,y=x2,col=yhat10))+geom_point(size=0.08) +
  labs(title = 'K=10',subtitle = 'Probability threshold of 0.2')

Q2 – Logit Comparison

Fit an interacted logit to this data. (i.e. model \(Y\sim x1+x2+x1:x2\) – using glm, not a LASSO). Find the predicted probabilities for every point in our grid, and plot those predicted probabilities.

gen_dat <- data.frame(y=y,x1=x1,x2=x2)
mymod <- glm(y ~ x1 + x2 + x1*x2, data = gen_dat, family = 'binomial')
preds <- predict.glm(mymod, grid, type = 'response')
ggplot(df, aes(x=x1,y=x2,col=preds))+ geom_point(size = 0.9)

Q3 – ROC

Plot the (in-sample) ROC curves for both the logit model and the KNN with \(K=10\). (hint: I have a function for doing this, given outcomes and probabilities in the lecture)

# Function for plotting ROC Curves
roc <- function(p,y, ...){
  y = factor(y) # outcomes
  n = length(p)
  p = as.vector(p) # probabilities
  Q = p > matrix(rep(seq(0,1,length=100),n),ncol=100,byrow=TRUE)
  specificity = colMeans(!Q[y==levels(y)[1],])
  sensitivity = colMeans(Q[y==levels(y)[2],])
  plot(1-specificity, sensitivity, type="l", ...)
  abline(a=0,b=1,lty=2,col=8)
  paste('Alexandria Uscasio Cortez value (kind of):', sum( (sensitivity - (1-specificity)/100) )) # Riemann sum for aoc
}
borgine <- gen_dat %>%
  select(-y)
phat10 <- apply(borgine,1,knn_prob,x1=x1,x2=x2,y=y,k=10)
roc(p=predict(mymod, type='response'), y=y, bty="n")

## [1] "Alexandria Uscasio Cortez value (kind of): 36.8884509897397"
roc(p=phat10, y=y, bty = "n")

## [1] "Alexandria Uscasio Cortez value (kind of): 48.4470483896598"
get_area <- function(sens, spec){
  
}

Which of these models looks better?

KNN model look better bc

Q4 – Survey

Please complete the surveys posted last week (and in announcements on canvas). They will help me ensure the last 5 weeks of class are as useful to you as possible.

Optional

  1. Plot the resulting predictions from a \(k=30\) KNN for Q4.
  2. Find OOS ROC curves in Q2 – build a holdout, then use predictions on it. 3. Do K-fold cross validation for the OOS ROC curves
  3. Add squared terms to the logit, remake predictions and plots (e.g. y~x1+x1^2+x2+x2^2+x1:x2)

Optional Long Q

We have a new output with three categories.

# x1,x2,y from your sample with 1k observations need to exist to run this.
set.seed(101)
z = ifelse((x1>0.8 | x2 < 0.4),rbinom(length(y),1,0.8),(y==1)*2)

Modify knn_pred so that it predicts the most likely category out of 3 categories (hint: the functions table, which.max, names, and as.integer are how I did this) (hint2: maybe start by building a function that takes a vector and finds the most common element, then fit it into the rest of this). Plot the grid-predictions with this new classifier.